Predicting Credit Card Approval with ML
Predicting credit card approval using Logistic, Tree-based, and Neural classifiers with cost-aware learning.
This project was developed in collaboration with Paolo Caggiano for the Machine Learning course in the Master’s Degree in Data Science.
With the growing number of credit card users, it has become crucial for banks to distinguish between low-risk and high-risk customers.
Financial institutions, both public and private, rely on various client attributes such as personal information, income, expenses, and employment status to assess creditworthiness.
Effective analysis of these variables helps prevent both technical and financial losses.
The objective of this research was to identify clients who, based on their financial history, are more likely to default.
Specifically, we aimed to determine which variables play the most significant role in predicting whether a loan will be repaid on time.
This goal was pursued through classification, a fundamental technique in machine learning.
To address this problem, we used a public dataset from the Kaggle platform.
After handling missing values and performing other preprocessing steps, we trained models using five different algorithms:
Logistic Regression, Random Forest, Multilayer Perceptron, Naive Bayes, and Naive Bayes Tree.
Model performance was evaluated rigorously using K-Fold cross-validation to obtain more robust estimates.
To improve the quality of predictions, we addressed class imbalance in the dataset by implementing both oversampling techniques and a cost matrix to reflect the asymmetric cost of misclassification.
We also applied feature selection to remove irrelevant or redundant information and speed up model training. Feature selection involves identifying the most informative variables while discarding noise.
In conclusion, the improvements observed across all five classifiers' F-scores demonstrated that both oversampling and cost-sensitive learning are effective strategies for handling imbalanced datasets.
In particular, cost-sensitive learning outperformed oversampling alone, and could be further enhanced by integrating real-world cost matrices provided by financial institutions.
Regarding the performance of individual algorithms:
- Logistic Regression and Naive Bayes performed consistently well across all evaluation settings.
- Random Forest and Naive Bayes Tree benefited significantly when combined with cost-sensitive learning.
- Multilayer Perceptron showed excellent performance when trained on all features with cost-sensitive learning, but suffered a notable drop in performance when combined with feature selection.